1 



Status of the CP-PACS Project 

A. Ukawa a 

institute of Physics, University of Tsukuba, Tsukuba, Ibaraki 305, Japan 

The CP-PACS Project, which started in April 1992, is a five-year plan to develop a massively parallel computer 
for carrying out research in computational physics with primary emphasis on lattice QCD. This article describes 
the architectural design of the CP-PACS computer, the entire computing system including the front end and mass 
storage, and results of benchmarks for the expected performance for lattice QCD applications. 



o 

Q 

00 
<N 

o 

(N 

1—1 
Oh 

43 



1. THE CP-PACS PROJECT[l] 

The project name CP-PACS stands for Com- 
putational Physics-Parallel Array Computer Sys- 
tem. The name was chosen to reflect the two 
phases of the project - development of a massively 
parallel computer optimized for physics problems 
describable in terms of space-time fields, and sub- 
sequent research with it in several key areas of 
computational physics with primary emphasis on 
lattice QCD. 

The preparation for the project was initiated 
in the early summer of 1990. The proposal, 
submitted to the Ministry of Education, Science 
and Culture of the Japanese Government, was 
approved in March 1991. The project formally 
started in April 1992 with the funding of 1.5 bil- 
lion Yen, to be spread over a five-year period be- 
tween fiscal 1992 and 1997, allocated from a spe- 
cial category of the Grant-in- Aid of the Ministry 
of Education, Science and Culture for promoting 
innovative and fundamental research in science. 
Concomitant with the start of the project, Center 
for Computational Physics was founded at Uni- 
versity of Tsukuba in April 1992 in order to serve 
as a base for a collaborative effort between physi- 
cists and computer scientists for the development 
of the CP-PACS computer and its utilization for 
research in computational physics. 

The project currently consists of 28 members 
as listed in Table 1. This represents a significant 
expansion over the predecessors of the project, 
the PAX Project(1977-1986) of T. Hoshino[2] and 
the QCDPAX Project(1987-1989) of Y. Iwasaki 
and T. Hoshino[3], which, carried out by a hand- 
ful of people, pioneered the development and ap- 



plication of parallel computers to scientific cal- 
culations in Japan. As Table 1 clearly shows, 
the CP-PACS Project is a multi-disciplinary ef- 
fort toward the advancement of computational 
physics encompassing not only several branches 
of physics but also computer science to develop 
parallel computers best suited for such applica- 
tions. The project is headed by Y. Iwasaki and 
the development of the computer system is led by 
K. Nakazawa. 

Development of a massively parallel computer 
requires advanced semiconductor technology. We 
discussed the aim of the project with a num- 
ber of manufacturers and invited proposals in the 
period of 1991-1992. We have selected Hitachi 
Ltd. for the manufacturing of the CP-PACS com- 
puter through a formal bidding process in the 
early summer of 1992. The company has since 
been working with us in a close collaboration both 
on the hardware and software development of the 
CP-PACS computer. 

2. CP-PACS COMPUTER 
2.1. Design parameters 

The target specification of the CP-PACS com- 
puter is summarized in Table 2. The basic strat- 
egy we have adopted for the design is the usage of 
a fast RISC micro-processor for high arithmetic 
performance at each node and a linking of nodes 
with a flexible network so as to be able to han- 
dle a wide variety of problems in computational 
physics. The unique features of the CP-PACS 
computer reflecting these goals are represented 
by the special node processor architecture called 
pseudo vector processor based on slide-windowed 
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Table 1 

CP-PACS Project members 



hardware 


software 


particle physics 


astrophysics 


solid state physics 


K. Nakazawa" 


I. Nakata" 


Y. Iwasaki s 


M. Miyama' 1 


S. Miyashita' 


H. Nakamura 


Y. Yamashita° 


A. Ukawa 3 


T. Nakamura ! 


M. Imada™ 


T. Boku a 


Y. Oyanagi c 


K. Kanaya' 1 


M. Umemura' 1 




T. Hoshino 6 


T. Kawai d 


S. Aokis 


Y. Nakamoto 3 




T. Shirakawa 6 


M. Mori 6 


T. Yoshie 71 






K. Wada a 


Y. Watase-f 
S. Ichii^ 


M. Fukugita* 
M. Okawa-? 
N. Ishizuka 9 
H. KawaP 







" Institute of Information Sciences and Electronics, University of Tsukuba 

6 Institute of Engineering Mechanics, University of Tsukuba 

c Department of Information Science, University of Tokyo 

d Department of Physics, Keio University 

e Department of Engineering, University of Tokyo 

f Data Handling Division, KEK 

3 Institute of Physics, University of Tsukuba 

h Center for Computational Physics, University of Tsukuba 

* Yukawa Institute for Theoretical Physics, Kyoto University 

1 Theory Division, KEK 

k National Observatory 

' Institute of Humanities and Environments, Kyoto University 
m Institute of Solid State Physics, University of Tokyo 



registers (PVP-SW)[4] and the choice of a three- 
dimensional crossbar network. 

The computer is scalable up to 2048 nodes. In 
order to exploit this potential, a request is being 
made for an additional funding for a grade-up of 
the CP-PACS computer to the Ministry of Edu- 
cation, Science and Culture . 

2.2. Node processor 
2.2.1. PVP-SW[4] 

A wide class of numerical calculations in scien- 
tific problems is well suited for vector processing. 
Vector enhancement of RISC micro-processors, 
however, faces a number of engineering problems. 
One of the problems is that of a large memory la- 
tency, namely the time needed to fetch data from 
main memory to registers is much larger than the 
processor machine cycle time. The bandwidth be- 
tween memory and processor also has to be suffi- 
ciently high so that the processor is kept supplied 
with new data at every machine cycle. Finally 
the number of floating point registers should be 
large in order to handle complex calculations in 
scientific applications. 



It is worth noting that a cache memory widely 
used in commercial RISC micro-processors is in- 
adequate to overcome these problems for large- 
scale scientific applications since data size easily 
exceeds the cache memory capacity and that data 
generally have a low temporal locality. 

The slide- windowed registers with preload fea- 
tures^] resolve the problem of large memory la- 
tency and the necessity of a large number of float- 
ing point registers while keeping upward compati- 
bility with commercial RISC micro-processors, in 
which usual arithmetic instructions can specify 
only a limited number of registers such as 32. 

A schematic illustration of slide- windowed reg- 
isters is given in Fig. 1. The number of phys- 
ical floating point registers is some large num- 
ber to > 32, while the number of logical registers 
specifiable by usual instructions is 32. The logi- 
cal registers are divided into g global registers and 
32— g local registers. The global registers are fixed 
at the top of the physical registers. The local 
registers, on the other hand, can slide along the 
rest of the physical registers with the slide length 
taking any value specifiable in user application 
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Table 2 

Target specification of the CP-PACS computer 



g-1 a 



physical registers 



peak speed 
main memory 
parallel architect 

number of nodes 

node processor 
architecture 
#FP registers 
clock cycle 
1st level cache 
2nd level cache 

network 
node array 
through-put 
latency 

distributed disks 
total capacity 

software 
OS 

language 



300Gflops(64 bit data) 
64GB 
sue MIMD with 

distributed memory 
1024 

HP PA-RISC1.1+PVP-SW 
128 

150MHz 

16KB(I)+16KB(D) 
512KB 
3-d crossbar 
8 x 17 x 8* 
>150MB/sec 
<7/isec 

3.5" RAID-5 disk 
538GB 

UNIX micro kernel 
FORTRAN, C, assembler 



"including nodes for disk I/O 

programs. The totality of the local registers at 
a particular position and the global registers is 
called a window. 

Suppose that the logical registers are at the 
window i. While carrying out computations us- 
ing the registers in this window, the processor 
can issue preload instructions which fetch data 
from memory to registers in any forward window 
j > i + 1. The processor can also issue poststore 
instructions which store data in any previous win- 
dow k < i — 1 to memory. Instructions that follow 
a preload or a poststore do not wait completion of 
the preload and poststore instructions. For m suf- 
ficiently large, one can clearly arrange the value 
of j so that the data already reside in registers 
when the window is shifted to j, thereby effec- 
tively reducing the memory latency to null. 

A high bandwidth between memory and reg- 
isters is an independent requirement. A con- 
ventional method for solving this problem is to 
pipeline the memory with multiple interleaved 
banks, which we adopt for our processor. 

Let us illustrate how pseudo vector process- 
ing works with the slide-windowed registers by 



window 
window i-1 
window i 



global 



logical(slide windowed) registers 
31 



preload data 
from memory 
to forward window 



slide length 



window i+1 global 



[used for 
global data 



poststore data 
to memory from 
previous window 



current window 
executing computations 



Figure 1. Structure of slide- windowed registers. 



loop count 


i -2 i-1 


i 


i+1 




MC instruction 


1 Preload 


Y(i-1) 








2 Add 


aX(i -2)+Y(i -2) 








3 Preload 

4 Mult 


aX(M) 


X(i) 




previous 
window 


5 Poststore 


Z(i -2) 








6 Window switch 










7 Preload 




Y(i) 






8 Add 

9 Preload 

10 Mult 


aX(i-1)+Y(i 


-1) 

aX(i) 


X(i+1) 


current 
window 


1 1 Poststore 


Z(i -1) 








12 Window switch 










13 Preload 






Y(i+1) 




14 Add 

15 Preload 

16 Mult 




aX(i)+Y(i) 


aX(i+1 


next 
window 


1 7 Poststore 

18 Window switch 




Z(i) 







Figure 2. A possible instruction scheduling for 
DAXPY loop with PVP-SW. 

a simple example of the DAXPY loop Z{ = a X 
%i + Vi,i = l,'''! 71 - For each loop cycle, we 
need to load X{ and yi, carry out a multiplication 
(a,Xi) — > a x Xi and an addition (a x Xi,yi) — ► 
a x Xi + yi, and finally store Z{. 

The idea to vectorize these operations is to in- 
terleave instructions from neighboring loop cycles 
so that the processor machine cycles are filled up 
with instructions. This can be carried out by an 
application of the software pipelining technique 
called module scheduling[5,4], which results in the 
scheduling shown in Fig. 2. Here we assumed a 
memory latency of 7 machine cycles and a latency 
of 3 machine cycles for the completion of an addi- 
tion or a multiplication. Branch instructions are 
omitted for simplicity as well as the superscalar 
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features. 

We observe that there are two preloads, one 
poststore, one multiplication and one addition for 
each window as required. However, the instruc- 
tions in a window generally refer to different loop 
cycles. The memory latency is completely hid- 
den by the use of preloading to the next win- 
dow. Compared to 15 machine cycles needed for 
a scalar code, a loop cycle is reduced to 5 + 1 ma- 
chine cycles with the last machine cycle necessary 
for the window switch instruction. 

To write an executable assembler code we still 
need to decide the slide length and allocate reg- 
isters. The minimum value of slide length equals 
the number of preloads, which equals 2 for the 
present case. The constant a common to all the 
loop cycles is stored in a global register. Alloca- 
tion of Xi and yi to local registers can be worked 
out by a coloring technique [6]; we refer to Ref. [4] 
for details . 

2.2.2. Implementation 

We employ the Hewlett-Packard PA-RISC 1.1 
architecture for the base architecture of the node 
processor. The instruction set is enhanced to in- 
clude preload, poststore and window switch for in- 
corporating the PVP-SW. The peak speed of the 
processor is 300Mflops for 64 bit data with the 
machine cycle of 150MHz. 

A detailed examination of the performance of 
the enhanced architecture for the Livermore ker- 
nels^] has shown that a memory latency of up to 
70 machine cycles can be hidden if the number 
of physical registers m exceeds 96. Benchmarks 
for the core of the minimal residual solver for the 
Wilson quark action yielded a somewhat larger 
estimate of m « 110 (see Sec. 4). We have cho- 
sen to = 128 for the number of slide-windowed 
physical registers. 

The processor has 64Mbyte of DP AM for main 
memory, which is pipelined with multiple inter- 
leaved memory banks. The bandwidth is de- 
signed to be roughly one 8byte word for each ma- 
chine cycle. 

The processor also has two levels of cache mem- 
ory; the on-chip first level cache for data and 
instructions separately, each with a capacity of 
16Kbyte, and an off-chip second level cache of at 
least 512Kbyte. 




PU (Processing Unit) 
IOU (IO Unit) 
Crossbar Switch 
• EX (Exchanger) 
Disk 

Figure 3. Schematic diagram of CP-PACS. 

2.3. Network 

The network of the CP-PACS computer is 
best illustrated by a schematic diagram shown 
in Fig. 3. The 1024 processors are arranged in 
a three-dimensional 8 x 16 x 8 array. To con- 
nect the processors together, a number of cross- 
bar switches (shaded bars) are placed in x,y and 
z directions. The crossbars for different directions 
are connected at each crossing point by a router, 
called an exchanger (shaded circles), which is a 
4x4 crossbar itself. A processing unit (PU) 
consisting of a processor with its memory(shaded 
squares) is attached to each exchanger. 

Inter-node communication is made by message 
passing. Transfer of data within the network 
proceeds via wormhole routing through the ex- 
changers. The direction of routing is fixed to 
x — ► y — ► z to avoid deadlocks. The network also 
supports broadcast and barrier synchronization. 

This represents a very flexible architecture al- 
lowing data to be sent from a processor to any 
other processor in at most three steps. The net- 
work also allows a hardware barrier to be set up 
across several two-dimensional planes so that the 
processor array can be divided into a set of inde- 
pendent subarrays. For lattice QCD applications, 
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Figure 4. Schematic diagram of IOU and PU. 

a crucial advantage of the architecture is that vir- 
tually any lattice size can be simulated, subdivid- 
ing the processor array if necessary. Hence finite 
lattice size effects, which have proven very im- 
portant in recent lattice QCD studies, could be 
examined in detail. 

Transfer of data is controlled by a UNIX micro 
kernel which resides in each processor. There is a 
fast mode of communication in which data in the 
user application area of memory are directly sent 
to the network interface without buffering to the 
system memory area, and are directly written into 
the memory of the receiving processor also with- 
out system buffering. The fast mode supports a 
block-strided transfer, i.e., a set of data regularly 
spaced in memory addresses can be sent by a sin- 
gle command. 

The bandwidth of each crossbar is at least 
150Mbyte/sec. The latency, namely the setup 
time for sending and receiving data, is less than 
7/isec for the fast mode combining those due to 
hardware and software. 

2.4. Distributed disks 

The CP-PACS computer is designed to have 
over 500Gbyte of disk space for storing interme- 
diate results. Important practical problems with 
such a large capacity is the reduction of I/O time 
between processor memory and disks, and the 
guard against loss of data due to disk failures. 

These problems are solved by distributing disks 
across the x — z plane of the processor array and 



employing the RAID-5 technology. As shown in 
Fig. 3, an I/O unit (IOU) is attached at the end 
of each crossbar in the y direction, to which is 
connected a 3.5 inch RAID-5 disk with a capacity 
of 8.4Gbyte through a SCSI-II bus. Thus the CP- 
PACS computer has 8 x 8 = 64 IOU's with a total 
peak throughput of 64 x 10 = 640Mbyte/sec and 
a total disk space of 538Gbyte. To accommodate 
the I/O units, the crossbars in the y direction 
are enlarged to 17 X 17 crossbars. A schematic 
diagram of IOU is given in comparison to that of 
PU in Fig. 4. 

2.5. Software development 

2.5.1. Operating system 

The CP-PACS computer is run on the UNIX 
operating system. Each node processor, however, 
carries only a micro kernel of UNIX in order to 
save memory for user application programs and 
to avoid a degradation of performance. The ker- 
nel handles memory control, inter-node commu- 
nications, process scheduling, interruptions and 
I/O. Full UNIX operating system is implemented 
on one of the IOU, named SIOU, which controls 
parallel processes through the network. 

The operating system has several new functions 
added for parallel processing; software partition- 
ing of the processor array so that independent 
programs may be run on different partitions, and 
generation of processes over a user-specified num- 
ber of nodes to execute a parallel program. 

The file system is logically structured to form 
a single tree for the entire CP-PACS computer. 
A logical single file, however, may be physically 
divided across the disks so as to reduce I/O over- 
heads. These divisions and allocations are made 
automatically by the operating system. 

2.5.2. Programming environment 

The programming language of the CP-PACS 
computer is FORTRAN, C and assembly lan- 
guage. An assembler code can be included as a 
subroutine in a FORTRAN or C code in order 
to maximize the performance. FORTRAN and 
C compiler are being developed incorporating the 
PVP-SW enhancement. The technique of modulo 
scheduling and register coloring can be employed 
for the compiler algorithm (see Sec. 2.2.1), and a 
prototype of a FORTRAN compiler has already 
been written. 
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Figure 5. Plan for the computing system at the 
Center for Computational Physics. 

3. FRONT END AND MASS STORAGE 

We show in Fig. 5 the plan for the computing 
system at the Center for Computational Physics. 
In order to store data produced by the CP-PACS 
computer, two levels of mass storage are consid- 
ered: 300Gbyte of disks and up to 50Tbyte of 
cartridge magnetic tape library. Transfer of data 
is mediated by a main-frame host computer us- 
ing a high speed link with the peak throughput of 
lOOMbyte/sec. The host also serves as a front end 
for controlling the CP-PACS computer and its job 
scheduling. Program development and data anal- 
yses are carried out on the host and work stations 
connected by an FDDI and Ethernet local area 
network, which also serves as a connection to the 
outside of the Center through the University of 
Tsukuba Campus LAN. 

4. LATTICE QCD BENCHMARKS 
4.1. Wilson minimal residual solver 

Let us describe the expected performance of the 
CP-PACS computer for lattice QCD calculations 
based on benchmarks assuming 2048 processors. 

The core of the minimal residual solver for the 
Wilson quark action may be symbolically writ- 
ten as H = G — k * U * G where G is the 
solution vector, U the link matrices and n the 
hopping parameter. For each Dirac index, the 
calculation of V = U * G requires 30 loads, 6 
stores, 36 additions and 36 multiplications. Ide- 
ally these operations can be executed in 36 ma- 
chine cycles with the superscalar architecture of 



PA-RISC 1.1. The calculation of G - k * V, 
on the other hand, is less balanced requiring 
56 machine cycles to accommodate 40 loads, 8 
stores, 56 additions and 16 multiplications. Thus 
the theoretical maximum of the efficiency de- 
fined by ^operations/ (2 x ^machine cycle) is 
(36 + 36 + 56 + 16)/(2 x (36 + 56)) = 82%. 

In collaboration with Hitachi Ltd., we have de- 
veloped a hand-optimized assembler code for this 
part of the solver including the PVP-SW features. 
The code uses a slide length of 10 and preloading 
to a window 6 times forward to the current win- 
dow. The efficiency turned out to be 75% for an 
infinite lattice and 74% for a 64 4 lattice divided 
into 8 X 4 X 4 X 64 sublattices on an 8 X 16 X 16 pro- 
cessor array. The degradation of efficiency stems 
from window switch instructions and address cal- 
culations. 

The efficiency above does not take into account 
the effect of memory bank conflict. This effect 
was estimated by a simulation program which in- 
corporates the actual structure of memory. It was 
found that the loss of efficiency is less than 1% if 
the number of physical registers is 110 or more, 
which is satisfied with 128 registers adopted for 
the processor. 

The communication overhead was estimated 
assuming a bandwidth of 150Mbyte/sec and a 
latency of 7/isec for the crossbar network. The 
efficiency decreases from 74% to 57% for a 64 4 
lattice. The loss primarily comes from the band- 
width, which we hope to improve for the actual 
machine. 

A similar estimate for the Kogut-Susskind 
quark action yielded an efficiency of 81% for a 
single processor and 67% including communica- 
tions for a 64 4 lattice. 

For the Wilson minimal residual solver with 
a red/black preconditioning we estimate the ef- 
ficiency including communications to be 63%, of 
which 82% are for floating point operations and 
18% for communications. The actual execution 
time for a single iteration of the solver is expected 
to take 0.075sec for a 64 4 lattice. 

4.2. CPU time estimates for physics runs 

Representative estimates of CPU time for lat- 
tice QCD runs based on the benchmarks of 
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Table 3 

Estimate of CPU time for physics runs on the 
CP-PACS computer assuming 2048 processors. 



(a) quenched Wilson spectroscopy 



assumptions: 






size 


64 4 






0.9 - 0.4 in steps of 0.1 


# conf. 


500 




CPU time (days) 






gauge sweep 


22 




quark propagator 




and observables 


36 




total 


58 




(b) full QCD configuration generation 


assumptions: 






HMC(Wilson) or 


hybrid R(KS) 




red/black MR(Wilson) or CG(KS) 




# trajectories = 


2000 




CPU time (days) 






size action 








0.6 0.5 0.4 


0.3 


32 3 x 64 W 


65 109 187 




KS 


6 12 34 




48 3 x 12 W 


64 111 


210 


KS 


7 22 


63 



Sec. 4.1 are listed in Table 3. For quenched QCD 
we expect to carry out exhaustive measurements 
with the quark mass ranging from heavy to light 
for a lattice size of typically 64 4 . Simulations car- 
ried out for three values of lattice spacing to con- 
trol the continuum limit should take about half a 
year of CPU time. 

Full QCD simulations are much more time con- 
suming. The situation is particularly severe for 
the Wilson quark action. Studies are currently 
conducted to search for an efficient paralleliza- 
tion of preconditioning methods such as the in- 
complete LU decomposition which is known to 
reduce the CPU time by a factor of five or more 
for vector computers. 

5. SUMMARY 

In this article we have described the CP-PACS 
Project and the architecture of the planned par- 
allel computer. At present the hardware design 



is near completion and checks by simulation are 
well underway. The physical packaging design 
is also being finalized. A prototype, mainly for 
the purpose of checking the crossbar network, 
has been built, and tests of software have been 
started. Manufacturing of components will start 
early next year, and assembling of the 1024 node 
CP-PACS computer is scheduled for completion 
by the spring of 1996. The start of physics com- 
putations is expected by the spring of 1997. 
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